Role-based similarity in directed networks 
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The widespread relevance of increasingly complex networks requires methods to extract mean- 
ingful coarse-grained representations of such systems. For undirected graphs, standard community 
detection methods use criteria largely based on density of connections to provide such representa- 
tions. We propose a method for grouping nodes in directed networks based on the role of the nodes 
in the network, understood in terms of patterns of incoming and outgoing flows. The role groupings 
are obtained through the clustering of a similarity matrix, formed by the distances between feature 
vectors that contain the number of in and out paths of all lengths for each node. Hence nodes 
operating in a similar flow environment are grouped together although they may not themselves be 
densely connected. Our method, which includes a scale factor that reveals robust groupings based 
on increasingly global structure, provides an alternative criterion to uncover structure in networks 
where there is an implicit flow transfer in the system. We illustrate its application to a variety of 
data from ecology, world trade and cellular metabolism. 
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The recent surge of interest in the study of complex 
networks spans diverse disciplines, from physics and com- 
puter science to biology and the social sciences. Classic 
examples include the Internet, protein interaction net- 
works, food webs or social groupings, among many oth- 
ers. [1-4]. A network is a collection of nodes connected by 
edges that represent interactions. In many instances, the 
edges have an associated direction or weight but the vast 
majority of research to-date has focused upon unweighted 
and undirected graphs. Network representations have 
the advantage that they capture naturally properties at 
the system level starting from individual constituents. 
However, with the growth of computational capability 
and high-throughput technologies, network representa- 
tions quickly become so complex as to lack intelligibility. 

A key challenge in this area is the development of meth- 
ods to obtain simplified reduced representations of com- 
plex networks in terms of subgraphs or communities, i.e., 
meaningful groupings of nodes that are significantly re- 
lated. For instance, nodes are likely to belong together 
if they are part of a tightly-knit group with many con- 
nections within the group and fewer to external nodes. 
The flurry of research on clustering of networks and com- 
munity detection [5] has led to the rediscovery of classic 
results in graph partitioning, and to the development of 
new measures such as modularity [6] and various spec- 
tral algorithmic procedures [7, 8]. Most methods have 
focused on undirected networks, where it is natural to 
consider structural metrics based on the density of intra- 
and inter-community edges. However, there is a large 
class of networks where the directionality of the edges 
is essential and where an analysis based on undirected 
graphs risks missing key properties of the system. Exam- 
ples include social networks, food webs, the world wide 
web and systems involving causality, such as metabolic 
and genetic networks. Only recently, extensions of no- 
tions of modularity for directed graphs have been pro- 



posed [9, 10] as well as other measures based on diffusion 
dynamics that can be applied to both directed and undi- 
rected graphs [11, 12]. 

Here, we introduce an alternative measure for the 
grouping of nodes in directed networks. Given that the 
defining characteristic of directed graphs is the implicit 
existence of flows, we propose to group nodes according 
to their role in the network, defined in terms of the overall 
pattern of incoming and outgoing flows. Essentially, the 
profile of paths for each node is a vector that is computed 
from the powers of the adjacency matrix weighted with 
a scale parameter to yield a similarity matrix, defined by 
the distances between such node vectors. This matrix 
is then clustered to find groupings of nodes with similar 
profiles of reachability flows at all lengths. For instance, 
in our analysis, all nodes that are sources are found to be 
similar to each other, while sinks are grouped together. 
In between these extremes, nodes are grouped accord- 
ing to a quantitative measure that reflects the mixture of 
'hub' vs. 'authority' characteristics of each node with re- 
spect to all paths in the graph. Our definition is inspired 
by a vast array of literature from the social sciences, 
dealing with structural and regular equivalence [13-16], 
and from computer science, where alternative algorith- 
mic measures of similarity have been considered [17-20]. 

Our methodology can be used to unveil groups distinct 
to those found by community detection algorithms based 
on density of connections. Indeed, nodes that play sim- 
ilar roles may be only weakly connected. For instance, 
in a food- web, two predators are not likely to be linked 
directly although both perform the same function and 
would be canonically grouped within the same trophic 
level. Hence, role similarity can uncover a coarse-grained 
functional representation for networks where the domi- 
nating feature is the transfer of an underlying quantity 
(e.g., information, energy, matter, etc). This role-based 
representation is relevant in fields such as ecology, eco- 
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nomics, social sciences and cellular metabolism, where 
it can aid in the assignment of a putative function to 
uncharacterised nodes and in establishing functional re- 
lations between seemingly distant network elements. 
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FIG. 1. Role clustering for the directed path graph in (A). 
The construction of the flow matrix X (shown in (B) for f3 — 
1) is followed by the construction of the similarity matrix 
Y (shown in grayscale in (C) for f3 <C 1 and f3 = 1). The 
resulting role groupings of the nodes are shown in (D). 

The measure is defined as follows. Consider a directed 
graph with N nodes and adjacency matrix A, which is 
in general asymmetric. The number of outgoing paths of 
length k for node i is given by the i-th coordinate of the 
vector [A fc l], where 1 is the N x 1 vector of ones. Simi- 
larly, the number of incoming paths of length k for node i 
is: [A T Note that the case k = 1 corresponds to the 
out-degree and in-degree which, from this perspective, 
represent the number of paths of length one originating 
or terminating at the node. 

We now construct a matrix that compiles the incoming 
and outgoing paths of all lengths up to k max by append- 
ing the column vectors indexed by path length and scaled 
by the factors f3 k : 



X 



xi 



[... {pA'fl ...|... {/3A) k l ...]. 



Here, f3 = a/Ai, with Ai the largest eigenvalue of the 
adjacency matrix and < a < 1. The parameter a is a 
scale factor that allows us to tune the weight of the local 
environment (short paths) relative to the global network 
structure (long paths). The presence of the factors f3 k 
ensures the convergence of the sequence of the columns 
due to the asymptotic limit lim/^oo ^pjp ~^ Ai. [20] 

Each row vector of X contains the flow profile of a node 
in terms of the scaled number of incoming and outgoing 
paths of all lengths starting and ending at that node (see 
Fig. 1). Our criterion to group nodes together is that 
they have similar flow profiles. This can be quantified 
via a distance between the vectors x^. A simple choice 
of metric is the cosine distance, which leads to the sym- 
metric similarity matrix Y defined by: 



Yin 



(1) 



where element Yij provides a normalized measure of the 
closeness of the flow profiles of nodes i and j. Dissimilar 



flow profiles have a similarity value close to zero, while 
alike flow profiles lead to a similarity close to one. The 
groupings of nodes are obtained from the clustering of 
this similarity matrix: nodes in the same cluster have 
similar flow profiles and can be considered to play a sim- 
ilar role in terms of the flow in the directed graph. It is 
important to remark that the clustering of the similarity 
matrix can be performed with any of a variety of meth- 
ods available for weighted symmetric graphs. In what 
follows, we have chosen a spectral algorithm based on a 
Multiple Normalized Cut [8, 21] together with Gaussian 
preprocessing of the weights. However, the results do not 
depend heavily on the choice of clustering algorithm. 

Our procedure is illustrated in Figure 1 through the 
simple example of a path graph. We scan the groupings 
as a function of the scale factor a so as to reveal role 
groupings based on an increasingly global flow structure. 
When a is small, short path lengths dominate and nodes 
are classified in terms of their local properties. In the 
limit a — >• 0, only paths of length one contribute to the 
clustering, which is equivalent to classifying nodes ac- 
cording to their in- and out-degree. In this limit, the 
nodes of the path graph (Fig. 1) are classified into three 
groups according to their role: input — > intermediate — » 
output, i.e., all the internal nodes are identical based 
on their short-scale patterns of in- and out-flows. As 
a grows towards 1, longer paths are given increasingly 
more weight and the global flow structure of the network 
is taken into account to cluster the nodes. For the path 
graph in Fig. 1 taking into account the global structure 
(in this case, the presence of end nodes) means that each 
node is classified as having a different role. In some sim- 
plified examples, the groupings are identical at all values 
of the scale parameter, as in the test examples in [9] (not 
shown) in which the nodes can be distinguished based 
on their in- and out-degrees. Similarly, the flow example 
presented in [11] is also reduced into a meaningful rep- 
resentation of two groups (not shown). In general, how- 
ever, robust non trivial clusterings are found for values 
of a — »> 1 in more complex examples. 

We have used our method to analyze several types of 
networks from real data where flows are intrinsic to the 
system. Below, we present three examples taken from 
the Social Sciences, Ecology and Biochemistry. Figure 2 
shows the role-classification in a world-trade network of 
manufacture of metals in 1994 [13, 22]. A well-established 
concept in this literature is that the world economy can 
be broken down into a core, a semi-periphery and a pe- 
riphery. Dominant core countries tend to specialize in 
high-tech production requiring capital, whereas periph- 
eral countries supply raw materials and labor intensive 
products. As a consequence, there tend to be lots of 
connections within the core but few trade connections be- 
tween members of the periphery. Figure 2 shows that our 
algorithm finds a robust classification into three groups 
that can be ascribed to this conceptual framework. 
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FIG. 2. World trade network of manufacture of metals. Our 
algorithm finds a robust grouping in which each country is 
classified into core, semi-periphery and periphery. The color- 
coded reduced representation is shown in (A). 



Our second example is the food web of St Marks river 
in Florida [23, 24]. This ecological network is under- 
pinned by an underlying flow of carbon (i.e., assimilated 
matter and energy). Importantly, trophic levels are not 
defined by the density of internal connections but rather 
by their role (or position) within the flows of the network. 
Figure 3 shows that the groupings produced by our al- 
gorithm detect trophic levels with the expected content. 
Carbon producers such as algae and bacteria are grouped 
together with other basal taxa as sources in the network. 
Above these are small bottom-feeding fish such as Spot 
and Tongue fish, as well as some benthic invertebrates. 
One more level up are most fish, along with some preda- 
tory invertebrates such as shrimp and omnivorous crabs. 
The top level consists of all birds, large predatory fish 
and other sinks of the system. 

Our final example comes from metabolic networks, an 
area where identifying functional modules is crucial [25]. 
These networks have been analyzed using methods for 
undirected graphs, thus ignoring the inherent direction- 
ality of metabolite transformation in cellular pathways. 
Figure 4 shows our results for the largest connected com- 
ponent of the widely studied metabolic network of E. coli 
developed by Ma and Zeng [26] from the Kyoto Encyclo- 
pedia of Genes and Genomes (KEGG) [27]. Our results 
reveal the existence of a core, semi-peripheral and pe- 
ripheral organization, a structure that is common among 
metabolic networks of many species and has been hypoth- 
esized previously [28], but with a finer, more nuanced 
substructure. The network divides naturally into six sig- 
nificant groups including two types of input nodes, two 
types of cores, a set of intermediates and one group of 
outputs (Fig. 4B). We have used the extensive biological 
and functional characterization of metabolites in KEGG 
to examine the significance of these groups. Figure 4C 
shows that the metabolites in the core groups, and specif- 
ically those in Core 2, have a high metabolic importance, 
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FIG. 3. Analysis of an ecological example, the St Mark's 
foodweb, showing the position of each species into role group- 
ings akin to trophic levels (A) with the reduced depiction (B) 
indicating the flow of carbon through the network. 



measured as the relative participation in different path- 
ways. Hence these metabolites can be seen as forming 
the reservoir of cellular building blocks that are key to 
the function and interconnection of pathways in the cell. 
In addition, we have characterized the KEGG pathway 
types in terms of our roles. Figure 4D shows, for instance, 
that central pathways such as carbohydrate and energy 
metabolism have an over-representation of core groups 
while, on the other hand, core groups are not involved in 
signaling pathways, which are dominated by a direct flow 
from input through intermediates to outputs. Unsurpris- 
ingly, biosynthetic internal pathways contain no inputs 
or outputs as they are used to generate intermediate and 
core metabolites. The detailed analysis of this functional 
classification will be presented elsewhere. 

We have introduced here a conceptual basis for the 
grouping of nodes in directed networks based upon their 
role in the network, as established by the patterns of 
incoming and outgoing flows of all lengths. Our mea- 
sure can be computed by taking successive powers of 
the adjacency matrix and convergence is ensured natu- 
rally within our definition. This measure formalizes and 
combines concepts present in the social network litera- 
ture (e.g., structural equivalence) with ideas of similarity 
drawn from computer science. In fact, one can show that 
the similarity matrix Y can also be calculated iteratively 
based upon node similarity by computing the normalized 
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FIG. 4. Role-based clustering of the metabolic network of 
E. coli (largest connected component with N = 563 nodes): 
force-based Kamada-Kawai layout of the network (A) and 
reduced representation (B) showing the position of each role 
grouping. (C) The metabolic importance of both core groups 
is above average. (D) The distribution of roles with respect 
to KEGG pathway classification shows a markedly different 
contribution from each group. 



sum of the convergent terms of: 



Y™\ = A[J- 



Ai 



Y° 



Y? + i=A T (j+^V^ A, 

where J is the matrix of ones and Yo is the matrix of 
zeros. This algorithmic formulation allows for simplified 
updated computations in a format equivalent, yet func- 
tionally distinct, to other methods [17, 20]. In summary, 
our approach provides an alternative method to commu- 
nity detection algorithms for the simplification and ab- 
straction of complex networks where directionality and 
flow transfer (rather than density of connections) is the 
fundamental ingredient to the description of the system. 



Our application to examples from a variety of fields high- 
lights the applicability of such ideas across disciplines. 
KC is supported by the Wellcome Trust. 
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